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■ Electronic properties of DNA are believed to play a crucial role in many phenomena in living 

^ ! organisms, for example the location of DNA lesions by base excision repair (BER) glycosy- 

0^ I lases and the regulation of tumor-suppressor genes such as p53 by detection of oxidative dam- 

^ . age. However, the reproducible measurement and modelling of charge migration through 

T1 ' DNA molecules at the nanometer scale remains a challenging and controversial subject even 
> ■ 

^ ■ after more than a decade of intense efforts. Here we show, by analysing 162 disease-related 
genes from a variety of medical databases with a total of almost 20, 000 observed pathogenic 
mutations, a significant difference in the electronic properties of the population of observed 
mutations compared to the set of all possible mutations. Our results have implications for 
the role of the electronic properties of DNA in cellular processes, and hint at the possibility 
of prediction, early diagnosis and detection of mutation hotspots. 
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Cells tend to accumulate over time genetic changes such as nucleotide substitutions, small in- 
sertions and deletions, rearrangements of the genetic sequences and copy number changes.^ These 
changes in turn affect protein-coding or regulatory components and lead to health issues such as 
cancer, immunodeficiency, ageing-related diseases and other disorders. A cell responds to genetic 
damage by initiating a repair process or programmed cell deathP In recent years, a vast num- 
ber of detailed databases have been assembled in which rich information about the type, sever- 
ity, frequency and diagnosis of many thousand of such observed mutations has been stored.'^M^l 
This abundance of data is based on the now standard availability of massively parallel sequenc- 
ing technologies P Harvesting these genomic databases for new cancer genes and hence potential 
therapeutic targets has already demonstrated its usefulness^ and several recent international cancer 
genome projects continue the required large-scale analysis of genes in tumoursP 

The possible relevance of charge transport in DNA damage has recently also attracted consid- 
erable interest in the bio-chemical and bio-physical literature.E^G^ Direct measurement of charge 
transport and/or transfer in DNA remains a highly controversial topic due to the very challeng- 
ing level of required manipulation at the nano-scale.'^ Ab-initio modelling of long DNA strands 
is similarly demanding of computational resources and so some of the most promising computa- 
tional approaches necessarily use much simplified models based on coarse-grained DNA.fn] Here 
we compute and datamine the results of charge transport calculations based on two such effective 
models for each possible mutation in 162 of the most important disease-associated genes from four 
large gene databases. The models are (i) the standard one-dimensional chain of coupled nucleic 
bases with onsite ionisation potential^^Il as well as a novel 2-leg ladder model with diagonal 
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couplings and explicit modelling of the sugar-phosphate backboneP' 
Results 

Point Mutations and Electronic Properties We consider native genetic sequences and mutations 
of disease-associated genes as retrieved from the Online Mendelian Inheritance in Man (OMIMj^ 
of NCBI, the Human Gene Mutation Database (HGMD)^ International Agency of Research on 
Cancer (lARC)^ as well as Retinoblastoma GeneticsWWe have selected these genes such that (i) 
those from OMIM have a well-known sequence with known phenotype as well as at least 10 point 
mutations, (ii) all other selected cancer-related genes have also at least 10 point mutations and (iii) 
all non-cancer related genes from HGMD have at least 200 point mutations (cp. Supplementary 
Table [SB- 

Many different types of mutation are possible in a genetic sequence including point muta- 
tions, deletion of single base pairs (producing a frame shift), and large-scale deletion or duplication 
of multiple base pairs. Here, we restrict our attention to point mutations as it allows us to directly 
compare the sequence before and after the mutation. We study the magnitude of the change in 
charge transport (CT) for pathogenic mutations when compared to all possible mutations either lo- 
cally, i.e. at the given hotspot site, or globally when ranked according to magnitude of CT change. 
We find that the vast majority of mutations shows good agreement with a hypothesis where small- 
est change in electronic properties — as measured by a change in CT — corresponds to a mutation 
that has appeared in one of the aforementioned databases of pathogenic genes. 
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A gene with J\f base pairs (bps) has a native nucleotide sequence (si, S2, ■ ■ ■ , s^r) along the 
coding strand. The gene has a total of SA/" possible point mutations, which we denote as the set 
Mall, of which a subset Mpa are known pathogenic mutations. A point mutation is represented by 
the pair (fc, s), where k is the position of the point mutation in the genomic sequence and s is the 
mutant nucleotide which replaces the native nucleotide. We shall write a mutation from a native 
base P to a mutant base Q as "Pq". We note that there are a total of twelve possible point mutations 
in a DNA sequence (from any one of four bases to any one of three alternatives). Of these twelve, 
four are transitions, in which a purine base replaces a purine or a pyrimidine replaces a pyrimidine, 
and eight are transversions in which purine is replaced by pyrimidine or vice versa. Biologically, 
transitions are in general much more common than transversions Indeed, the set of observed 
pathogenic mutations for our 162 genes contains 10999 transitions and 8883 transversions, whereas 
in the set of all mutations their ratio is by definition 1 : 2. The observed pathogenic mutations are 
thus already a biased selection from the set of possible mutations, favouring transitions. However, 
this local onsite chemical shift is not sufficient to fully explain our data as we will show later. 

We compute and datamine the results of quantum mechanical transport calculations based 
on two effective Hiickel model^ for each possible mutation in those 162 genes. Both models 
assume tt-tt orbital overlap in a well-stacked double helix. The parameters are chosen to represent 
hole transport. Using the transfer matrix methocP^I^ we calculate the spatial extent of (hole) 
wavefunctions of a given energy on a length of DNA with a given genetic sequence. Wavefunction 
localisation is directly related to conductance^^ and we therefore find it convenient to report our 
results in terms of conductance. 
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To determine the effect of a mutation, we consider sub-sequences of length L bps; there 
are L such sequences that include a given site k. For all L sequences we calculate quantum- 
mechanical charge transmission coefficients T (in units of e'^/k, averaged across a range of incident 
energies, as detailed in Methods) for the native and mutant sequences. We describe the effect of 
the mutation on the electronic properties of the DNA strand near to the mutation site using the 
mean square difference, T = (iTnativc — T'mutantP), averaged across all L sequences. Larger values 
of r therefore correspond to a greater difference in electronic structure between the native and 
mutant sequences. The length L must be long enough to allow for substantial delocalisation across 
multiple base pairs but should remain below the typical persistence length of ~ 150 bpP^ such 
that any overlap or crossing by packing, e.g. by wrapping around histone complexes in chromatin, 
can be ignored. In this study we have considered lengths of 20, 40, 60 bps. This requires, for each 
of the J\f sites in a gene, L calculations for each sequence of length L and for each of 4 possible 
bases at that site; which, for the more than 11 x 10^ bases in our dataset of 162 genes, is more than 
5 X 10^ quantum mechanical transport calculations. 

Local and global ranking We first compare T of each observed pathogenic mutation with the 
other two non-pathogenic ones at the same position and determine a local ranking (LR) of CT 
change. There are three possibilities of LR, namely low, medium and high. Note that those hotspots 
with more than one pathogenic mutations are excluded in the LR analysis. We have also sorted 
the LR ranking for each gene according to prevalence in Fig. [TJa-i-b). We find that for L = 20, 40 
and 60 the low CT change corresponds to 155 (95%), 148 (91%) and 140 (86%) of all 162 genes 
with pathogenic mutations. Examples of LR for the pathogenic mutations of pl6 and CYP21A2 
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are shown in Supplementary Fig. [S3l We graphically summarise the results for all 162 disease- 
associated genes in Fig. [S5l For each gene, we have shown a positive deviation from the 33% 
line by orange — supporting the scenario of small CT change for pathogenic mutations — and by 
blue when the results seem to show no or negative indication with CT change. It is clear that the 
correlation between low CT change and mutation hotspots is well pronounced. 

We can also consider a global ranking (GR) by sorting CT change T for all possible 3J\f 
mutations of a gene with J\f bps in order to get a ranking of every observed pathogenic mutation. 
By dividing each ranking by 3Af we compute the normalised GR 7 of the mutation, with values 
between and 1. Smaller values of 7 mean smaller CT change. By analogy to the local ranking, 
we divide the 7 of the pathogenic mutations into three groups as before, i.e. low (7 < 33.3%), 
medium (33.3% < 7 < 66.7%), and high (7 > 66.7%) CT change. The resuks of the GR for the 
162 genes are shown in the bottom row (c) and (d) of Fig. [TJ As for the LR results, we observe 
many 7 values with low CT change (cp. Supplementary Figs. [S3knd[S4l). Hence the LR and GR 
results consistently show that observed pathogenic mutations are generally biased towards smaller 
change in CT than the set of all possible mutations (cp. Supplementary Figs.[S5knd[S6l). 

Distributions of change in charge transport In Figure [2] we show as an example results for the 
distribution of T for the pl6 DNA strand for both ID and 2-leg models. In panels (a+b), it is clear 
that the 111 observed pathogenic mutations of pl6 have on average smaller changes in the CT 
properties as compared to all possible 80220 mutations, for both the ID and 2-leg models. We find 
that results for the vast majority of the other 161 genes are quite similar. The distributions of T 
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values in Fig.[2ta+b) are approximately log-normal. We therefore calculate, for each of the 162 
genes in our dataset, an average log T value for the distributions of all and pathogenic mutations. 
Histograms of the distributions of these (logF) values are shown in Fig.|2c+d). It is once again 
clear that the distributions for observed pathogenic mutations are shifted towards lower F values 
in both the ID and the 2-leg models. 

We next define a global CT shift for a gene (7 as = (logFg^aii) — (logF^^pa). Positive 
values of indicate that the observed pathogenic mutations of gene g have a lower average F. 
For each of our 162 genes we obtain the distribution of A^, for the ID and 2-leg models as shown in 
Figs.[2l^e-i-f). We can define, for the whole set of 162 genes, an average global shift A = Ag/162, 
weighting all genes equally; we can also weight the results by the number of observed pathogenic 
mutations for each gene |Mpa|g for a weighted average global shift A = Y^g l^palgA^. 

These values are also indicated in Figs. [2l^e-i-f) and in both models there is a tendency towards 
lower average A^ for observed pathogenic mutations. 

Transitions and transversions In our models we would expect transitions to cause, in general, a 
smaller change in CT than transversions, as the change in onsite energy and in transfer coefficients 
is smaller for a transition than a transversion. However, as we will demonstrate here, the increased 
proportion of transitions among the observed pathogenic mutations is not sufficient to account for 
the distributions seen in Fig. [2l 

In Fig. [3];a-i-b) we show the distribution of F values for our entire dataset of all ~ 34 x 10^ 
possible mutations and 19882 known pathogenic mutations, dividing the datasets into transitions 
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and trans versions. For both models, the transitions are shifted to slightly lower T values than the 
trans versions. However, in the 2-leg model, the distribution for observed pathogenic transitions 
appears co-located with the distribution for all transitions, and likewise for transversions. In the 
ID model, by contrast, the observed pathogenic transitions are visibly shifted to lower T values 
than the set of all transitions, and the same is true for transversions. 

In Fig. Hc+d) we represent the distributions of F values for each of the twelve types of point 
mutation by points for the mean values of log F and bars indicating the standard deviation of the 
distribution of log F. In the 2-leg model, the distributions for observed pathogenic mutations are 
essentially coincident with the distributions for all mutations for each type Pq. The positive A 
and A shift results in the 2-leg model are thus accounted for by the set of observed pathogenic 
mutations being biased towards transitions. The ID model displays a quite different behaviour; 
in each case the mean of the distribution for the observed pathogenic mutations of any type Pq, 
lies from 7.5 to 20 standard errors below the mean for all possible mutations of type Pq. Hence 
the probability that the observed pathogenic mutations are a random subset of all mutations, with 
respect to their electronic properties in the ID model, is comparable to the probability of drawing 
twelve values more than 7.5 standard deviations below the mean from a normal distribution, which 
is less than 10^^^^. The observed difference between CT change between observed pathogenic and 
all possible mutations is thus statistically highly significant irrespective of whether transitions or 
transversions are involved. In the 2D model, by contrast, the means of the log F distributions for 
observed pathogenic mutations can lie either above or below those for all mutations for different 
types Pq, and the difference in the means — between 0.03 and 5.5 standard errors — is much 
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smaller. 



Let us also consider, for each gene g, simulation length L and each mutation type Pq whether 
the subset shift X = (lograii) — (logTps)^ j^p^ is positive or negative. This gives us, for each 
model, 162 x 3 x 12 = 5832 data points, less 1029 cases where no calculation is possible as no 
pathogenic mutations of type Pq are known for gene g. These A data are presented in Fig. HI In 
the 2-leg model there are approximately equal numbers of negative and positive A values. This 
is consistent with a null hypothesis where the observed pathogenic mutations of a type Pq have 
the same distribution of T vales as for all mutations of that type. In the ID model, by contrast, 
such a null hypothesis is decisively rejected: there is a preponderance of positive A values by 
almost 2 : 1 (3326 positive to 1513 negative) and the binomial probability of obtaining such a 
result at random would be approximately 10~^^^. The two analyses agree that observed pathogenic 
mutations display a significant bias towards smaller changes in electronic properties in the ID 
model. 

Discussion 

Our CT models act as probes of the statistics of the DNA sequence. It is possible that we are merely 
observing a correlation; i.e. that mutations are more likely to occur in areas of the genome with 
certain statistical properties, for reasons not causally related to charge transport, and these prop- 
erties correlate with biased CT properties in our ID model. Such a correlation between quantum 
transport and mutation hotspots would in itself be a valuable and novel observation in bioinfor- 
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matics. There are known chemical biases in the occurence of mutations, such as the enhanced 
transition rate in C-G doublets the bias towards GC base pairs rather than AT pairs in biased 
gene conversioEP*^ and the tendency of holes to localise on GG and GGG sequences and there 
cause oxidative damage.!^ However, since our observed bias is consistent across all twelve types 
of point mutation, these known biases cannot fully account for our data. 

There are also plausible causal connections between our data and cellular genetic processes 
where the electronic properties of DNA may be significant. One such process is gene regula- 
tion, where charge transport along the DNA strand can couple to redox processes in DNA-bound 
proteins, inducing protein conformational change and unbinding.!^ Similarly, it has been proposed 
that DNA repair glycosylases containing redox-active [4Fe-4S] clusters^^ may localise to the site of 
DNA lesions through a DNA-mediated charge transport mechanism.!^ The recognition of specific 
areas in the DNA sequence by DNA-binding proteins generally may involve electrostatic recogni- 
tion of the target DNA sequenccl^^ Furthermore, homologous recombinatioiP^ — a process which 
is vital to the repair of double-strand breaks, a most serious DNA lesion,'22llSl and also to genetic 
recombination — relies on the mutual recognition of homologous chromosomes before strand in- 
vasion can occur. Homologous double- stranded DNA sequences are capable of mutual recognition 
even in a protein-free environment,!^ presumably via electronic or electrostatic interactions .'^ 

All the above processes, especially those involving protein-DNA or DNA-DNA recognition, 
would be less disrupted by a smaller change in the electronic environment along the coding strand. 
From this point of view, the observed mutations are biased to cause less disruption to gene regula- 



10 



tion and DNA damage repair in the cell. This may seem counterintuitive at first. However, in order 
for a mutation to appear in our dataset of pathogenic mutations, the cell and the organism must 
develop viably for long enough for a mutant phenotype to be observed. Mutations which cause 
large disruptions to DNA regulation and repair are more likely to be lethal to the cell at an early 
stage and will thus be absent from disease databases. Similarly, mutations which are more visible 
to DNA repair mechanisms are less likely to persist and to appear in databases. 

Genetic repair and regulation mechanisms cannot know whether the consequences of a mu- 
tation are beneficial, neutral or harmful. We would therefore predict that neutral mutations should 
display the same bias, towards smaller change in electronic structure, as we observe in the pathogenic 
mutations. As a first test of this prediction, we have considered the case of the TP53 gene, with 
20303 base pairs and for which there are known 2003 pathogenic mutations, 366 silent mutations 
and 113 intronic mutations.'^ We have simulated these silent and intronic mutations using the ID 
model. Histograms of the distribution of T values for these mutations are given in supplementary 
material, see Fig.[S7l In Table [H we analyze the statistical properties for the resulting T distribu- 
tions; our results demonstrate that, for both transitions and transversions, the silent and intronic 
mutations are similar to the pathogenic mutations and significantly disimilar to the population of 
all possible mutations, as predicted. 
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Methods 



Models of charge transport in DNA. The simplest model of coherent hole transport in DNA 
is given by an effective one-dimensional Hiickel-Hamiltonian for CT through nucleotide HOMO 
states where each lattice point represents a nucleotide base (A,T,C,G) of the chain for n = 
1, . . . , N. In this tight-binding formalism, the on-site potentials e„ are given by the ionisation 
potentials ec = 7.75eV, ec = 8.87eV, eA = 8.24eV and = 9.14eV, at the nth site, cp. Fig.[5l the 
hopping integrals t„,n+i are assumed to be nucleotide-independent with tn,n+i = 0.4eV.'^ A model 
which is less coarse-grained is provided by the diagonal, 2-leg ladder model shown in Fig. [51 Both 
strands of DNA and the backbone are modelled explicitly and the different diagonal overlaps of 
the larger purines (A,G) and the smaller pyrimidines (C,T) are taken into account by suitable inter- 
strand couplings P^'^ The intra-strand couplings are 0.35eV between identical bases and O.lTeV 
between different bases; the diagonal inter-strand couplings are O.leV for purine-purine, 0.0 leV 
for purine-pyrimidine and O.OOleV for pyrimidine-pyrimidine. Perpendicular couplings to the 
backbone sites are 0.7eV, and perpendicular hopping across the hydrogen bond in a base pair is 
reduced to O.OOSeV. 

The 2-leg modeP^ allows inter-strand coupling between the purine bases in successive base 
pairs, in accordance with electronic structure calculations'^, and should therefore be a better model 
for bulk charge transport along the DNA double helix; the ID model, by contrast, makes use of 
the site energies of only the bases on the coding strand, ^ and so is most representative of the 
electronic environment along that strand. We also find that the 2-leg model recovers some of the 
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coding strand dependence of the ID model upon decreasing the diagonal hoppings. For 28 genes, 
we find that reducing only the diagonal hopping elements by two leads to a much greater agreement 
with the ID results similar to Fig. Wic). 

Calculation of quantum transmission coefficients. The quantum transmission coefficient T{E) 
for a DNA sequence with length N bps for different injection energy E can be calculated for 
both models by using the transfer matrix method.l^l^^ Let us define Tj^iiE) as the transmission 
coefficient for a part of a given DNA sequence which starts at base pair position j and is L base 
pairs long. The position-dependent averaged transmission coefficient at the k—th base pair for 
transmission length L bps is defined as 



Here j ranges from k — L + 1 to k such that each subsequence of length L contains the kth base 
pair. Eq and Ei are the lower and upper bounds of the incident energy of the carriers, e.g. for 
the ID model used here, the values are 5.75 and 9.75eV, respectively; for the 2-leg model the 
bounds are 7 and lleV. We have used an energy resolution of AE = 0.005eV. Then we examine 
the difference between transmission coefficients of the normal and mutated genomic sequence of 
a point mutatiorP' and hence denote by rj^'*^ the transmission coefficient of the same segment of 
DNA as Tj'^^ but with the point mutation {k, s). T^l^'^^ is the averaged effect of the point mutation 
{k, s) on CT properties for all subsequences of length L containing the mutation. 
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TabiG 1 : Mean logarithm of CT change T for gene TP53 using the ID model with L = 20. Data are divided 
into transition and transversions. We give standard errors of the mean (SEM) and standard deviations (cj) 
for each distribution. From these we estimate the probabihty of each distribution being a random sample 
from the set of all mutations, paii> or being a sample from a population similar to the pathogenic mutations, 
Pps, (cp. Fig. ISTI) . There are 224 silent transitions and 142 silent transversions; 67 intronic transitions and 
46 intronic transversions. The pathogenic mutations and all possible mutations outnumber the silent and 
intronic populations by factors of 10-1000 and so it is the SEM for the smaller populations that is significant. 
It is clear that the mean CT change logio ^ for the silent and intronic populations is far more similar to the 
pathogenic populations than to the entire population of all possible mutations. This is true for both transitions 
and transversions, although the p- value for the intronic transitions is not statistically significant (i.e. > 0.05) 
which we attribute to the small number of available intronic data. 



19 




1 


1 ' 1 

•— • low, L=20 








• — • low, L=60 




X — K medium, Z,— 20 - 




X — K medium, L=40 




Jt — K medium, L— 60 




□— □ high, Z^20 




□ — □ high, L-40 




□ — □ high, L-60 


1 , 1 , \ w 



50 100 
ordered genes for each L 



150 



' 1 


' 1 ' 1 

•— • low, L=20 




• — • low, L=40 




• — • low, L=60 




X — K medium, L=20 - 




X — ■ medium, L=40 




>* — K medium, L=60 




0— O high, L=20 




— high, L=40 




— high, L=6Q 


1 


1 , 1 ' 



50 100 
ordered genes for each L 



150 



Figure 1 : Sorted prevalence of the low, medium and high CT change among local (a+b) and global 
(c+d) rankings for pathogenic mutations in 162 genes using the ID (a+c) and the 2-leg (b+d) 
models. Results are consistent for all three lengths L = 20,40,60. The 1/3 value expected by 
chance is shown as a dashed horizontal line. Low rankings are dramatically more prevalent locally 



and globally than chance would suggest. 
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Figure 2: (a+b) Distribution of the change in charge transport T for pathogenic (orange bars) and 
all possible (cyan bars) mutations for the pl6 (CDKN2A) gene with 26740 base pairs and 111 
known pathogenic mutations, (c+d): Distribution of the average (logarithmic) change in charge 
transport (logF) for all pathogenic (orange bars) and all possible (cyan bars) mutations for all 
162 genes, (e+f): Distribution of the global shift A values for all genes, showing a consistent 
tendency to positive values. The average A (das^jd) and weighted average A (dash-dotted) values 
are indicated by vertical lines similarly to the line (dotted). The grey bars denote the error of 



mean for (A). The results for the ID and 2-leg models are displayed in panels (a,c,e) and (b,d,f), 
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Figure 3: Distributions of F for the ID (a) and 2-leg (b) models for all genes, with mutations 
divided into transitions and transversions. The distributions are normalised by the size of the 
mutation dataset. Lines are guides to the eye only. The means (symbols) and standard deviations 
(error bars) of the distributions of log F are shown in panels (c) and (d) for the ID and 2-leg models. 
Estimated errors of the means are smaller than the symbols. Distributions are shown for transition 
(Ti) and trans version (Tv) mutations, and for the twelve types of point mutation individually. Open 
symbols (blue, cyan) are for the set of all mutations, filled symbols (orange, red) for the set of 
pathogenic mutations. 
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Figure 5: Schematic models for charge transport in DNA. The nucleobases are given as circles 
(red, denoting pairs) and ellipses (blue, brown for single nucleotides). Electronic pathways are 
shown as solid lines of varying thickness to indicate variation in strength. Model (a) indicates the 
ID model where the sugar-phosphate backbone is ignored. In model (b), brown circles denote 
the smaller pyrimidines, blue ellipses are the large purines and green circles denote the sugar- 
phosphate backbone sites. Note that diagonal hopping between purines is favoured, and between 
pyrimidines disfavoured, by the larger size of the purines. 
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Supplementary Material 



Comparing the Averaged Electronic Properties for the Pathogenic and Non-pathogenic Mu- 
tations for Each Gene 

We denote the genomic sequence of a gene with length J\f base pairs (bps) as (si, S2, ■ ■ ■ , sj^). 
Each point mutation of a given gene is characterized by the set {k,s), where k and s are the 
position of the point mutation in the genomic sequence and the mutant nucleotide which replaces 
the nucleotide Sk of normal DNA, respectively. There are totally 3J\f possible point mutations of a 
gene with M bps. The sets of these SA/" mutations and the pathogenic mutations for the gene are 
denoted as Mau and Mpa, respectively. Mpa is a subset of Mau. For every possible point mutation, 

(k) 

we compute the mean quantum mechanical transmission coefficient T2 of a subsequence with 
length L of the wild-type gene. Here the mean is determined by averaging over all individual 
transmission coefficients Tj ^ with j = k — L + l,k — L + 2, ...,k. In this way, the influence 
of the full neighborhood of hotspot k is taken into account and not just the mutation itself. The 
results of T^''^ for k G Mpa already show some signatures of atypical CT reponse for the ID 
model.'^ However, the signal is much less pronounced in the 2-leg model. Hence we study the 
difference in CT between a healthy DNA base and the 3 possible mutations. For example the 
hotspot 14585 of p53 contains the correct C/G base pair in the wild but of the three possible 
mutations C/G G/G, G/G A/T and G/G T/A only the last one is know to lead to 
cancer.l^ Averaging again over all incident energies and subsequences of length L containing the 
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hotspot (/c, s), we can characterize the average change in CT as 

^ j=k-L+i •'^0 H - ho 

with q = 1 or 2. We find that results for g = 1 and 2 are similar. Hence in the manuscript we 
restrict our discussion to g = 2. We calculate such F estimates for all possible 3J\f mutations 
of each gene and compare the probability distribution of CT change T^^'^^ for {k, s) G Mau and 
{k, s) G Mpa for each gene. The result for the plQ gene was shown in Fig. [2a) as an example. 
As a control group, we also shuffled the pl6 sequence randomly under the conditions that (1) the 
contents of the 4 bases are not changed, and (2) the positions of the mutations can be moved but 
the numbers of the 12 types of mutations are not changed. The distributions of the averaged F for 
ID and 2-leg models with L = 40 of the 20 shuffled sequences are shown in Fig. [ST] It is clear 
that the distributions of F for the Mau and Mpa are almost identical. 

CT Change for the 12 Type of Mutations 

The comparison of F between the pathogenic and all possible mutations for the 12 types of point 
mutations is shown in Fig. [S2] It is clear for the ID model (a-1) F tends to be smaller for the 
pathogenic mutations. However, the difference is not visible for the 2L model (m-x). 

Local ranking of point mutations at hotspot sites 



(k s) 

In order to study the local effects of pathogenic mutations on CT, we compare F}^ 2 of each 
pathogenic mutation {k, s) with the other two non-pathogenic ones at the same position k and de- 



termine the local ranking (LR) of CT change for {k, s). There are three possibilities of LR, namely 
low, medium and high. Note that those hotspots k with more than one pathogenic mutations are 
excluded in the LR analysis. As an example, percentages of the three LR for the pathogenic muta- 
tions of pi 6 are shown in the left panels of Fig.[S3l of pathogenic mutations with low CT change 
are evidently larger than the medium and high ones for all L. Let us again ask how significant this 
tendency is across all 162 genes. Figure [S4l shows similar ranking analysis results as in Fig.[S3]but 
now for all Mp^. We see that the tendency towards low CT change in the pathogenic mutations is 
quite strong overall. In Fig. [T] we have sorted the LR ranking for each gene according to prevalence. 
We find that for L = 20, 40 and 60 the low CT change corresponds to 155 (95%), 148 (91%) and 
140 (86%) of all 162 genes with pathogenic mutations. Note that similarly consistent is the result 
for large CT with only about 30 of all genes having high CT change. 

Global CT rankings at hotspot sites 

Another way to compare the CT change is a global ranking (GR). We have sorted the CT change 
r^^2 for all possible 3J\f mutations of a gene with J\f bps in order to get a ranking of every 
pathogenic mutation {k, s). By dividing each ranking by 3Af we compute the normalised GR 7^'^2*^ 

(k s) (k s) 

of the mutation with values between and 1. As before for F)^ ^ , smaller values of 7)^ ^ mean 
smaller CT change. To characterise the CT change in a quantitative way, we divide the '^^2^ 
of the pathogenic mutations into again three groups as before, i.e. low (7 < 33.3%), medium 
(33.3% < 7 < 66.7%), and high (7 > 66.7%) CT change. The distributions of the GR for 
the complete set of pathogenic mutations of pi 6 is shown in Fig. [S3] as an example. As for the 
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LR results, the pathogenic genes lead to many 7)^ 2 values with low CT change. This is most 
pronounced in the ID model as shown in Fig. ISSlc). The results of the GR for the 162 genes 
are shown in the bottom row (c) and (d) of Figs. [S4l and [B We see that the GR results are fully 
consistent with the LR rankings. 

Consistency of CT rankings for all DNA sequences 

The prevalence ordering as shown in Fig.[T]does not imply that the order of the genes themselves is 
the same in all parts (a), (b), (c) and (d) of the figure. Therefore we have calculated the correlations 
in the ordering and found that in both models and across models and for all L = 20, 40 and 60, 
we find positive correlation coefficients. Hence genes which have a low change in CT for, e.g., 
the local ranking at L = 20, also retain this low rank for the other L values as well as the global 
ranking. Similarly, this positive correlations implies that in those few case where the mutations in 
a gene lead to high CT change, they do so across all local as well as global rankings. This confirms 
that our results are internally consistent. 

We graphically summarise the results for all 162 disease-related genes in Fig.[S5l For each 
gene, we have shown a positive deviation from the 0.33 line by orange — supporting the scenario 
of small CT change for pathogenic mutations — and by blue when the results seem to show no or 
negative indication with CT change. The criteria corresponds to local and global ranking results 
for L = 20, 40 and 60 for the ID and the 2-leg models. Similarly, in Fig.[S6l we average of all 
12 criteria and show the resulting, overall agreement with the CT hypothesis: 161 of 162 genes are 
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above the 33% line and hence show that for both ID and 2-leg model and averaged over lengths 20, 
40 and 60, a small CT change correlates with the existence and position of pathogenic mutations. 
Only for STKl 1 do we see that there is no overall agreement. 

Difference and similarities in the two models 

The 2-leg modeP^l allows inter-strand coupling between the purine bases in successive base pairs, 
in accordance with electronic structure calculations,!^ and should therefore be a better model for 
bulk charge transport along the DNA double helix; the ID model, by contrast, makes use of the site 
energies of only the bases on the coding strand,!^ and so is most representative of the electronic 
environment along that strand. We also find that the 2-leg model recovers some of the coding 
strand dependence of the ID model upon decreasing the diagonal hoppings. For 28 genes, we find 
that reducing only the diagonal hopping elements by 1/2 leads to a much greater agreement with 
the ID results similar to Fig. [2c). 
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0.1 



(a) r (b) r 

Figure SI: (Supplementary) Distribution of the change in charge transport in (a) ID and (b) 2L 
models T for pathogenic (orange bars) and all possible (cyan bars) mutations averaged for the 20 
shuffled pl6 (CDKN2A) DNA strands with 26740 base pairs. AU resuhs shown are for L = 40, 
data for L = 20 and 60 are similar. 
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Figure S2: (Supplementary) Panels a-1: ID model, results divided into the twelve subtypes of mu- 
tation. The shift for pathogenic mutations is clearly present in every case. Panels m-x: 2L model, 
results divided into the twelve subtypes of mutation. There is no consistent shift for pathogenic 
mutations. 
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Figure S3: (Supplementary) Distribution of the local (a+b) and global (c+d) ranking results of 
pathogenic mutations of pl6 (CDKN2A) (blue solid lines) and CYP21A2 (green) as a function of 
window lengths L. The dashed lines indicate averaged results for 20 randomly shuffled pl6 se- 
quences. The left/right columns distinguish results for the lD/2-leg models. The dashed horizontal 
line shows the 33% mark expected for a completely random sequence. All lines are guides to the 



eyes only. Error bars are within symbol size. 
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Figure S4: (Supplementary) Distribution of the local (a+b) and global (c+d) ranking results of all 
19882 pathogenic mutations of the 162 genes as a function of window lengths L. The left/right 
columns distinguish results for the lD/2-leg models. The dashed horizontal lines show the 33% 
mark of a completely random sequence. All lines are guides to the eyes only. 
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SLC4A1 
SMAD4 
SPTB 
STKll 
TAT 
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TG 
TGFBR2 
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TPIl 
TSCl 
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VHL 
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• 
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• 
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Figure S7: Histograms of T distributions for (a) transitions and (b) transversions in TP53, simu- 
lated using the ID model and L = 20. Histograms are shown for all possible mutations and for 
pathogenic, silent and intronic subsets. The maximum heights of the populations are scaled to be 
2, 1.5, 1 and 0.5 to ease comparison. The scales factors are indicated by the dotted horizontal lines. 



12 



Table SI: (Supplementaiy) List of the 162 genes with their lengths 
(bps), number of all point mutations (Npa), and their numbers of 
the 12 types of point mutations. For example, Nai means the num- 
ber of yl — )• T substitution. 



Name 


Length 


N 


Nm 


Nac 




NTa 


Ntc 




Nca 


Net 


Ncg 


Noa 


Net 


Ngc 


ABCAl 


147154 


87 





4 


9 


2 


1 


2 


4 


24 


3 


18 


1 


1 


ABCA4 


128313 


382 


11 


9 


21 


13 


51 


21 


27 


73 


19 


99 


23 


15 


ABCDl 


19894 


223 


8 


7 


14 


6 


31 


3 


15 


46 


17 


47 


13 


16 


ACTAl 


2852 


164 


10 


7 


22 


5 


13 


6 


13 


12 


11 


29 


17 


19 


ACTCl 


7631 


14 





1 


3 














4 


1 


2 


1 


2 


AGA 


11668 


19 











1 


3 


1 





2 





8 


2 


2 


AGT 


11673 


10 








1 





1 


1 





5 





1 





1 


ALB 


17127 


63 


3 


2 


13 


2 


1 





1 


6 


1 


24 


4 


6 


ALDOB 


14448 


28 











1 


9 


1 


3 


5 


3 


3 


1 


2 


AMPD3 


56903 


11 





1 





1 


1 








6 


1 








1 


ANKl 


144397 


18 








1 





2 





1 


7 


1 


4 


2 





APC 


108353 


222 


10 





4 


18 


1 


8 


21 


83 


28 


18 


28 


3 


APOB 


42645 


51 








2 


4 


1 


1 


3 


26 


2 


8 


3 


1 


APOE 


3612 


33 





1 


1 





2 





2 


9 


2 


9 


2 


5 


APRT 


2466 


13 


2 





1 





3 








1 





4 


1 


1 


AR 


180246 


299 


11 


6 


24 


11 


31 


12 


22 


53 


25 


56 


31 


17 



13 



Name 


Length 


Npa 


NAt 


Nac 




NTa 


Ntc 




Nca 


Net 


Ncg 




Net 


Ngc 


ASAHl 


28574 


12 


1 





3 


1 








1 





3 


1 





2 


ATM 


146268 


169 


8 


3 


20 


9 


11 


15 


5 


55 


10 


19 


8 


6 


ATP7B 


78826 


315 


10 


14 


25 


14 


27 


10 


17 


62 


16 


68 


30 


22 


BCAM 


12341 


14 


1 





1 


1 








1 


4 


1 


5 








BCHE 


64562 


58 


6 


2 


6 


3 


6 


3 


2 


12 





8 


5 


5 


BRCAl 


81155 


301 


12 


6 


30 


14 


29 


23 


12 


63 


15 


38 


50 


9 


BRCA2 


84193 


162 


12 


9 


20 


8 


11 


8 


12 


33 


13 


15 


19 


2 


BRIPl 


180771 


13 


1 











1 


1 





3 


2 


2 


1 


2 


BTK 


36741 


329 


15 


14 


29 


19 


47 


23 


26 


44 


14 


48 


32 


18 


CAPN3 


64215 


213 


2 


9 


18 


5 


23 


6 


10 


45 


19 


48 


14 


14 


CASR 


102813 


144 


2 


5 


12 


4 


21 


7 


8 


20 


10 


38 


12 


5 


CBS 


23121 


107 


2 


1 


6 


4 


10 





4 


24 


7 


39 


2 


8 


CD55 


38983 


14 








1 


2 





1 





4 





3 


1 


2 


CDHl 


98250 


30 





1 


2 





2 


2 





9 


1 


8 


4 


1 


CDKN2A 


26740 


71 


1 


3 


4 


2 


6 


6 


5 


12 


3 


11 


8 


10 


CFCl 


6748 


10 














1 








4 


1 


4 








CF 


188699 


828 


35 


31 


103 


50 


85 


54 


47 


117 


41 


136 


84 


45 


CFH 


95494 


83 


3 


3 


8 


6 


10 


5 


2 


10 


6 


14 


13 


3 


CHEK2 


54092 


20 


1 


1 


2 





1 





2 


4 





7 


1 


1 


COLlAl 


17544 


292 





2 


2 





1 


2 


1 


21 


4 


134 


79 


46 


C0L2A1 


31538 


124 





1 


2 


1 


1 


1 


5 


26 





53 


19 


15 



14 



Name 


Length 


Npa 


NAt 


Nac 




NTa 


Ntc 




Nca 


Net 


Ncg 




Net 


Ngc 


COL4A5 


257622 


244 


2 





4 


2 


2 


5 


4 


20 


1 


117 


55 


32 


C0L7A1 


31088 


265 





3 


6 


2 


1 





1 


56 


7 


122 


34 


33 


CPOX 


14152 


36 





2 


1 





3 


1 





14 


2 


9 


3 


1 


CRBl 


210178 


91 


3 


1 


2 


8 


16 


7 


3 


11 


2 


22 


11 


5 


CRX 


21483 


18 





1 


1 








1 


2 


4 





8 





1 


CRYAA 


3773 


10 




















1 


5 





3 


1 





CRYGD 


2882 


12 





1 














4 


3 


1 


2 


1 





CYB5R3 


30587 


35 








3 





6 


2 


2 


12 





10 








CYP19A1 


129126 


13 














2 


1 





5 





5 








CYP21A2 


3338 


102 


4 


4 


5 


7 


8 


4 


6 


23 


2 


25 


4 


10 


CYP2A6 


6897 


12 


1 





1 


1 


2 








2 





2 


2 


1 


DPYD 


843317 


34 


2 


3 


7 


2 





1 


2 


7 





5 


4 


1 


DSP 


45077 


20 








2 


1 


1 


1 





6 


1 


6 


1 


1 


ERCC6 


80364 


18 


1 





2 


1 


1 


1 





10 


1 


1 








FIO 


26731 


81 


1 


4 


5 


1 


6 


2 


4 


11 


3 


33 


5 


6 


Fll 


23718 


131 


2 


5 


6 


3 


17 


3 


9 


28 


2 


29 


13 


14 


F13A1 


176614 


55 


1 





2 





6 


4 


4 


12 


3 


14 


8 


1 


F2 


20301 


42 





3 


3 





1 


1 





11 


1 


17 


3 


2 


F7 


14891 


164 


4 


1 


13 


1 


17 


4 


9 


30 


6 


55 


13 


11 


F8 


186936 


1168 


79 


47 


124 


56 


117 


78 


55 


153 


72 


198 


112 


77 


F9 


32723 


707 


31 


26 


55 


58 


69 


52 


42 


54 


28 


135 


95 


62 



15 



Name 


Length 


Npa 


NAt 


Nac 




NTa 


Ntc 






Net 






Net 


Ngc 


FAH 


33342 


26 


2 


1 


2 





1 


3 


2 


6 





5 


4 





FANCD2 


75502 


14 














3 


3 





4 





4 








FANCG 


6179 


16 














2 


1 





7 





2 


2 


2 


FBNl 


237414 


640 


18 


12 


52 


32 


88 


37 


21 


63 


32 


173 


68 


44 


FECH 


38454 


49 


2 


1 


2 


3 


7 


3 


1 


11 


1 


11 


4 


3 


FGA 


7618 


45 


3 


1 


3 


3 


1 


2 


3 


12 


2 


7 


7 


1 


FLCN 


24971 


11 








1 














4 


2 


3 


1 





FUTl 


7380 


22 








1 


2 


2 


1 


2 


5 


1 


4 


1 


3 


FUT3 


8587 


11 











1 





2 


2 








5 





1 


G6PC 


12572 


66 


2 


2 


3 


2 


8 


3 


3 


13 


2 


15 


5 


8 


G6PD 


16182 


163 


3 


3 


21 


4 


15 


4 


8 


27 


15 


39 


13 


11 


GAMT 


4465 


11 





2 








1 








1 


1 


3 


1 


2 


GBA 


10246 


259 


8 


11 


25 


8 


32 


19 


14 


42 


10 


53 


19 


18 


GCK 


45153 


255 


5 


13 


15 


7 


32 


8 


19 


40 


11 


64 


23 


18 


GHl 


1636 


35 


2 


2 


7 





3 


1 


1 


5 


2 


7 


3 


2 


GJBl 


10004 


240 


4 


5 


25 


18 


31 


12 


10 


39 


24 


39 


17 


16 


GJB2 


5513 


208 


8 


9 


19 


5 


28 


8 


12 


23 


15 


49 


19 


13 


GNAS 


71456 


51 


2 


2 


2 


1 


6 


2 


1 


17 


4 


9 


3 


2 


GPR143 


40464 


43 


2 





3 


2 


4 


3 


4 


6 


1 


10 


4 


4 


HBAl 


842 


73 


2 


5 


9 


2 


5 


2 


7 


6 


9 


8 


7 


11 


HBB 


1606 


263 


15 


20 


20 


21 


23 


16 


22 


26 


18 


38 


20 


24 
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Name 


Length 


Npa 


NAt 


Nac 




NTa 


Ntc 




Nca 


Net 


Ncg 


Nca 


Net 


Ngc 


HFE 


9612 


27 


1 


2 








4 


1 





3 


2 


1 


3 


4 


HMGCL 


23583 


27 


2 





2 





3 


1 


1 


4 


1 


8 


3 


2 


HSD11B2 


6421 


24 


1 





1 





3 


2 


1 


12 


1 


3 








HSD3B2 


7879 


32 





1 


1 


1 


2 


3 


3 


8 


3 


6 


2 


2 


IDS 


26493 


203 


15 


8 


15 


2 


16 


13 


17 


31 


19 


32 


20 


15 


INS 


1431 


30 








2 





3 


2 




3 


6 


6 


4 


3 


IRSl 


64538 


14 





1 


3 





1 







2 


1 


3 





2 


ITGB3 


58870 


53 


2 


2 


3 


1 


10 


4 




12 


1 


11 


5 


1 


JAGl 


36257 


131 


2 





3 


6 


11 


6 


11 


30 


12 


28 


16 


6 


KALI 


203313 


25 








1 


2 


1 


1 




9 


2 


6 


1 


1 


KCNEl 


65586 


17 








1 





2 







5 





6 


1 


1 


KCNH2 


32966 


266 


8 


11 


27 


5 


19 


12 


15 


61 


9 


43 


35 


21 


KCNQl 


404120 


226 


3 


2 


19 


8 


24 


5 


12 


44 


13 


61 


11 


24 


KEL 


21303 


33 


2 





3 


1 


3 








9 


1 


13 





1 


LDHB 


22501 


11 


1 


1 


1 





1 


2 


1 


1 





2 





1 


LDLR 


44450 


741 


23 


31 


48 


31 


84 


35 


51 


88 


48 


168 


92 


42 


LHCGR 


68951 


37 


2 


3 


3 


3 


7 


3 


2 


7 


1 


3 


2 


1 


LIPC 


136898 


11 





1 


2 








1 





2 





4 





1 


MAPT 


133924 


36 


3 


2 


2 





3 


2 


2 


6 


1 


9 


5 


1 


MCIR 


2360 


24 





1 


1 





4 





3 


8 





5 


1 


1 


MENl 


7779 


239 


10 


7 


8 


9 


26 


11 


19 


44 


14 


38 


33 


20 



17 



Name 


Length 


Npa 


NAt 


Nac 




NTa 


Ntc 




Nca 


Net 


Ncg 


Nca 


Net 


Ngc 


MLHl 


57359 


275 


16 


15 


26 


18 


19 


17 


18 


42 


20 


36 


28 


20 


MLH3 


37769 


17 





1 


5 





1 








2 


1 


4 


2 


1 


MSH2 


80098 


238 


16 


11 


25 


8 


9 


14 


11 


62 


14 


30 


25 


13 


MSH6 


23872 


54 


3 


1 


5 


2 


3 





3 


17 


6 


7 


4 


3 


MYH7 


22924 


268 


8 


10 


20 


4 


19 


8 


16 


47 


17 


80 


16 


23 


NFl 


282701 


338 


22 


4 


24 


20 


35 


26 


14 


82 


24 


44 


29 


14 


NF2 


95023 


72 


5 


2 


5 


2 


6 


1 


2 


25 


4 


7 


11 


2 


NPCILI 


28781 


26 








3 


2 











11 


1 


8 


1 





NR3C1 


157582 


14 


1 





1 


1 


4 


1 





1 





4 





1 


OAT 


21580 


42 








2 


2 


4 





3 


9 


2 


11 


5 


4 


OTC 


68968 


276 


16 


11 


28 


9 


31 


18 


17 


36 


15 


44 


27 


24 


PDE6B 


45199 


20 


1 








3 


3 


1 


2 


5 


1 


4 








PEXl 


41509 


24 














4 


1 


2 


7 


3 


6 





1 


PEX26 


11503 


10 














3 








3 


2 


2 








PEX6 


15143 


18 





1 








3 


1 


1 


7 





5 








PEX7 


91337 


24 


1 


2 


2 


1 


1 


3 


2 


6 


1 


4 


1 





PHKA2 


91305 


23 





1 


2 








1 


1 


11 





5 


2 





PKDl 


47189 


149 


2 


3 


6 


5 


12 


4 


8 


59 


10 


27 


8 


5 


PKD2 


70110 


35 


1 





1 


1 


1 


1 


2 


17 





7 


3 


1 


PKHDl 


472279 


213 


8 


10 


22 


7 


29 


9 


7 


50 


7 


38 


17 


9 


PMS2 


35868 


21 


3 


1 


1 


1 











6 





5 


4 






18 



Name 


Length 


Npa 


NAt 


Nac 




NTa 






Nca 


Net 






Net 


Ngc 


POUIFI 


16954 


22 


1 





2 


1 


3 


1 


1 


6 





4 


2 


1 


PROC 


10802 


203 


6 


6 


10 


3 


21 


6 


15 


40 


8 


55 


13 


20 


PRSSl 


3592 


26 


1 


2 


2 


2 


2 





2 


5 


1 


5 


2 


2 


PSENl 


83931 


154 


6 


8 


13 


8 


22 


11 


7 


21 


12 


19 


16 


11 


PSEN2 


25532 


18 


2 


2 


3 














5 


1 


5 








PTCHl 


73984 


59 


3 


2 


1 


2 


4 


2 


7 


15 


2 


11 


8 


2 


PTEN 


105338 


98 


2 


2 


10 


9 


13 


11 


5 


15 


8 


15 


6 


2 


PTS 


7595 


27 


1 





8 


1 





2 





6 


2 


4 


2 


1 


QDPR 


57702 


20 





1 


2 





3 


3 





3 





6 


1 


1 


RB 


180388 


226 


9 


8 


18 


12 


16 


11 


10 


38 


8 


51 


28 


17 


RP2 


45418 


17 








1 





1 


2 





5 


2 


4 


2 





RPE65 


21136 


42 


1 


1 


2 


1 


5 


3 


3 


9 


1 


7 


7 


2 


RPGRIPl 


63325 


24 





2 


5 


1 











7 





5 


3 


1 


RSI 


32422 


93 


3 





7 


5 


11 


4 


5 


15 


5 


19 


7 


12 


RYRl 


153865 


244 


5 


4 


21 


9 


20 


6 


10 


56 


14 


63 


17 


19 


SCN4A 


34365 


43 


1 





5 


2 


3 


1 


4 


7 


3 


12 


2 


3 


SCN5A 


101611 


226 





2 


18 


9 


16 


6 


13 


49 


10 


77 


15 


11 


SERPINAl 


12332 


29 


4 


1 





1 


2 


1 


2 


8 


1 


9 








SERPINA7 


3870 


16 


1 








2 


1 





1 


4 





5 


1 


1 


SLC25A20 


41966 


11 








1 














4 


1 


3 


1 


1 


SLC4A1 


18428 


65 


1 


1 


3 


2 


5 





6 


20 


4 


20 


1 


2 



19 



Name 


Length 


Npa 


NAt 


Nac 




NTa 


Ntc 




Nca 


Net 


Ncg 




Net 


Ngc 


SMAD4 


49535 


20 





1 


1 








2 


1 


6 


3 


4 


1 


1 


SPTB 


76865 


18 








2 


2 


2 


2 





6 


1 





1 


2 


STKll 


22637 


62 


4 


4 


2 


1 


4 


5 


7 


12 


5 


8 


8 


2 


TAT 


10242 


11 














1 


1 





5 


1 


2 


1 





TERT 


41881 


30 





1 


3 




2 








10 


3 


8 





2 


TG 


267939 


33 





1 


2 




2 


1 


1 


7 





14 


4 





TGFBR2 


87641 


14 








1 




1 








5 





3 


1 


2 


TNNI3 


5966 


30 





1 


5 




1 








8 


2 


10 





2 


TP53 


20303 


2003 


137 


113 


158 


121 


142 


109 


165 


284 


156 


252 


202 


164 


TPIl 


3287 


11 








1 




1 








1 





4 


1 


2 


TSCl 


53285 


44 


2 





1 




1 


2 


5 


19 


5 


5 


3 





TSC2 


40724 


165 


7 


4 


6 


5 


13 


5 


22 


48 


18 


22 


10 


5 


TSHR 


190778 


45 


1 





3 


1 


9 


2 


3 


8 


1 


12 


2 


3 


TTR 


6944 


98 


4 


5 


10 


6 


15 


9 


6 


5 


1 


19 


11 


7 


TYR 


117888 


205 


10 


10 


22 


6 


16 


6 


16 


27 


13 


42 


26 


11 


UROD 


3512 


45 





1 


2 


5 


6 


2 


3 


9 


2 


11 


2 


2 


USH2A 


800503 


66 





3 


1 


1 


2 


3 


6 


24 


3 


8 


10 


5 


VHL 


10444 


172 


5 


7 


12 


13 


22 


15 


7 


22 


21 


17 


18 


13 


WRN 


140499 


22 


3 


1 


1 


1 


1 








11 


2 


1 


1 





WTl 


47763 


56 


1 


2 


5 


1 


6 


3 


3 


13 


4 


11 


5 


2 



20 



